Reproducible reports with Quarto and RMarkdown

Julia Schulte-Cloos

January 25, 2023

👋🇼‌🇪‌🇱‌🇨‌🇴‌🇲‌🇪‌🙋

Overview

Part I: Reproducible Research

Why should you care about reproducible research?

🙌 Benefits yourself! 🙌

‘Create a better relationship with your future self’

Why should you care about reproducible research?

🚀 That’s the future of the social sciences! 🚀

Reproducibility vs. replication? 🤔

Replicability refers to situations in which a researcher obtains new data to reach the same scientific conclusions as a previous study, whereas reproducibility refers to situations in which the original researcher’s software, code, and data are used to regenerate the results.

Replication standards: guidelines, protocols, and software designed to help researchers share, analyze, archive, preserve, distribute, catalog, translate, verify, and replicate scholarly research data and analyses across disciplines. Includes proposals to improve the norms around data sharing and replication in scientific research.

What hinders reproducible research and what can facilitate it?


Obstacles 🚧

  • Infrastructure and research habits
  • Hardware requirements
  • Operating systems
  • Versions of software and libraries

Solutions ✨

  • Optimised workflows (integrating coding, authoring, version control)
  • Virtual machines for computationally demanding analyses
  • Containerisation

Part II: Literate Progamming and Executable Reports

Literate Programming

Communication via code 🗣️💻

Integrate computer code with software documentation in a single document

Minimal requirements of high-quality code? 👆

  • executes what it supposed to execute
  • runs, no defects or problems, and runs not only under some circumstances
  • easy to read, maintain, and extend

Good practices 😇

  • directory structure
  • relative paths: read.csv('./data/foo.csv')
  • compile documents in clean software sessions
  • do not set a working directory (or only globally, at the very beginning of a script) → documents should be self-contained and portable
  • attach information on sessionInfo()

How to design a well-structured project directory?

  • use a naming convention that is…
    • human readable: directory names that are easy to understand for you & someone not familiar with the naming convention
    • machine readable: avoid spaces
    • supports sorting: sort list of input files
  • directory names that contain components of the project and can be referenced in the code (e.g. figs, data, etc.)
- ./data
    + `raw_data.csv`
    + `tidy_data.csv`
    + `codebook.txt`
- ./analysis
- ./figures
    + ./interaction_plot.png
    + ./bar_plot.png
- ./paper
- ./presentation
- ./README.md

Getting started: Markdown, RMarkdown, and Quarto I

  • Markdown as a human readable way to style text

  • “Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).” John Gruber, founder of Markdown

  • R and RStudio (not the single IDE that supports RMarkdown, Visual Studio is also a great choice)

  • RMarkdown integrates R code into Markdown language through knitr

  • Quarto: extension of RMarkdown, optimised for language interoperability & CLI

Getting started: Markdown, RMarkdown, and Quarto II

Getting started: Pandoc and Lua I

  • Pandoc:
    • extremely powerful open-source document conversion tool
    • allows for conversion between different (40+) markup languages
    • conversion e.g., between docx, HTML, , and Markdown
  • Lua filters:
    • manipulate Pandoc Abstract Syntax Tree (AST) between writing & parsing phase
    • powerful collection of Pandoc Lua filters available open-source
    • extremely useful to adjust the standard RMarkdown framework for scientific use cases (e.g., blinded version of manuscripts, several bibliographies, etc.)

Getting started: Pandoc and Lua II

Markdown basics I

Text formatting and emphasis

  • bold text can be created with **bold text** or equivalently __bold__
  • italic text can be created with *italic* or equivalently _italic_

Markdown basics II

Sections

  • # A level-one section

  • ## A level-two section with a [link](/url)

  • # An unnumbered section {-}, or equivalently # An unnumbered section {.unnumbered}

  • always include blank line before a header
  • sections can be labelled and referenced by including an attribute after the header: {#sec:introduction}
  • if you do not specify a section id, Pandoc will automatically create one, e.g. # Reproducible research outputs{#reproducible-research-outputs}.

Markdown basics III

Lists

Bullet list

  • Bullet 1
  • Bullet 2
    • Sub-bullet 1
    • Sub-bullet 2

Numbered lists

  1. Point 1
  2. Point 2
    2.1. First sub-point
    2.2. Second sub-point

Markdown basics IV

Footnotes

Writing in “source” vs. “visual” mode

…mostly a matter of taste 🍷🍺

Narrative text & code integration I

Code chunks

Control how code and its products appear in your compiled report or manuscript. Code chunks are required to have unique names, e.g. {r data2017-tidy}

Chunk options

Define conditions under which the code is evaluated and how its output is processed within the document. Most frequent options include: eval, include, results, echo. Comprehensive list online, in the RMarkdown reference guide, and for Quarto. Most IDEs allow you to easily switch between different chunks.

Narrative text & code integration II

→ old-school way to specify chunk options

```{r elephant-chunk-1, out.width="20%", fig.align="center", fig.cap="Elephant in the room", echo="fenced"}
knitr::include_graphics(path = "figs/elephant.jpg")
```

Elephant in the room

→ more recently, chunk options can be specified in a YAML-style within the actual code chunk, for better readability

```{r elephant-chunk-2}
#| out-width: '20%'
#| fig-align: 'center'
#| fig-cap: 'Elephant in the room'
#| fig-alt: 'A pick elephant portraited in a room'
knitr::include_graphics(path = "figs/elephant.jpg")
```

A pick elephant portraited in a room

Elephant in the room

Narrative text & code integration III

Referencing actual results

```{r lm-cars-eval}
# A simple linear regression model
fit <- lm(dist ~ speed, data = cars)
```
The slope of the regression is
`r round(fit$coefficients[2], digits = 2)`.

The slope of the regression is 3.93.

YAML header

  • sets global parameters of document
  • E.g. output, title, author, date
  • YAML is a syntax (YAML Ain’t Markup Language, YAML)
  • tag-value pairs separated by colons
  • indentation is critical!
---
title: "Writing a reproducible research paper"
author: "Julia Schulte-Cloos"
date: 2023-01-25
format: 
  pdf: 
    execute: 
      echo: false
---

In doubt about YAML validity? Use an available YAML linter.

Parameterized reports

You can render your document by relying on globally specified parameter (YAML header) that will affect how your code is evaluated, e.g. by focussing only on a subset of your data.

---
title: "My Document"
params:
  alpha: 0.1
  ratio: 0.1
---

Lab Session I 👩‍💻 👨‍💻

Create a reproducible document that…

  • includes a title and your authoring information
  • features a footnote and an image
  • features some real literate programming (e.g., printing some calculation within the written text)
  • can be rendered both to HTML and PDF
10:00

Part III: Authoring with quarto

Layout

Table of contents

  • inclusion after title page
  • add parameter in YAML header
---
format: 
  pdf: 
    toc: true
---

Paragraphs and indentation

  • Pandoc option indent: true in the YAML header

Page margins and spacing

  • geometry option in the YAML header

Bibliography and citations

Bib-files and citations

Include your literature.bib file in the YAML header (YAML key: bibliography:) Cite any entry as recorded in the .bib-file by calling @palmerdata.2020 for inline citations and [@palmerdata.2020, p.10] for all other references.

CSL (Citation Styles)

If a csl style is specified, Pandoc converts Markdown references, i.e., @palmerdata.2020, to ‘hardcoded’ text and to a hyperlink to reference section in your document.

Biblatex / Natbib

If your document specifies a citation reference package like biblatex or natbib along with the related options, pandoc will create the corresponding LaTeX commands (e.g. \autocite, or \pcite) to create the references from Markdown references (not recommended because you are not flexible regarding output formats!)

Cross-references

Cross-reference sections, figures, tables or equations: e.g., @fig-elephant.

With colorlinks: true option in the YAML header, hyperlinks are colored

Section labels

If you do not specify a section label, Pandoc will automatically assign a label based on the title of your header. For more details, see the Pandoc manual. If you wish to add a manual label to a header, add {#mylabel} to the end of the section header.

Figures

Markdown syntax for figures

![An Elephant](elephant.png){#fig-elephant}

Cross-referencing figures

⚠️ Quarto uses a slightly different syntax to cross-reference figures than RMarkdown: @fig-elephant

Figure Divs

::: {#fig-elephant}

![](elephant.png){width="20%"}

Elephant
:::

Side-by-Side figures

Add a dedicated code chunk option #| layout-ncol: 2 to your code chunks to include several figures side by side.

This is very powerful in conjunction with #| fig-subcap: to specify captions for each of the figure.

#| label: fig-graphsidebyside
#| layout-ncol: 2
#| fig-subcap: ["Caption of left figure","Caption of right figure"]

Bibliographic reference list

Simple bibliography list

### References

::: {#refs}
:::

More complicated use cases

  • two or more separate bibliographies (e.g., one for main body, one for appendix)
  • use a quarto extension section-bibliograhies

Lab Session II 👩‍💻 👨‍💻

Create a reproducible document that…

  • adds a bib-file and cites some work out of it
  • features 1.5 line spacing in its PDF version
  • includes cross-references to a regression table (hint: try to work with modelsummary)

🎁 Bonus

Integrate two tables (or figures) side-by-side, each with its own sub-caption in your Quarto document

10:00

Part IV: Automatable reports for advanced users

Single-source publishing

  • one document that contains narrative text and code can be rendered to several output formats
  • e.g., blog posts, scientific manuscripts
  • ✌️ benefits? ✌️

Conditional code evaluation

Advantages? 🤔

Approach 1

  • add an option to a code chunk
  • e.g., #| eval: knitr::is_html_output()
  • powerful in conjunction with conditional content inclusion (see next slide)
  • allows you to exclude entire sections (including all relevant code chunks and text)

Approach 2

  • specify the execute YAML option
  • global execute options (no indentation, for any type of format)
  • format specific option (indentation, specific for each format)
---
format: 
  html:
    toc: true
    code-fold: true
    execute: 
      echo: true
  pdf: 
    toc: false
    execute: 
      echo: false
execute: 
    warning: false
    message: false
---

Conditional content

  • show content only in certain formats with the .content-visible class
  • hide content in certain formats with the .content-hidden class
::: {.content-hidden unless-format="pdf"}

Will only appear in PDF.

:::

Code chunks: reference labels

Reference labels of code chunks

The code chunk option ref.label takes a vector of chunk labels to retrieve the content of the respective chunks.

Use case: adding all code to the Appendix of a manuscript

ref.label can also evaluate R code, e.g. to retrieve the code of all labels within a document (knitr::all_labels()).

# Appendix: All code for this presentation
#| ref.label: knitr::all_labels()

…or a subset of chunks that are also evaluated when rendering the document:

#| ref.label: knitr::all_labels(eval == TRUE)

Quarto for scholarly writing I

Language interoperability

  • integration of narrative text and code in multiple programming languages (e.g., R, Python, Julia)
  • in addition to knitr engine, jupyter can be used
---
title: "My Document"
jupyter: python3
---

Quarto for scholarly writing II

Quarto offers more control regarding the inclusion of author-related meta-data (names, affiliations, contributions to the work) that is printed as part of the title, in some output formats. See the full documentation

---
author:
  - name:
      given: Norah
      family: Jones
      literal: Norah Jones
    attributes:
      corresponding: true
    orcid: 0000-1234-0000-5678  
---      

Quarto for scholarly writing III

Quarto extensions

Quarto formats

  • format extensions enable you to add new formats to the built-in formats (e.g. html, pdf, docx)
  • use case: provide default document options, style-sheets, header, footer, or logo elements

Lab Session III 👩‍💻 👨‍💻

Create a reproducible Quarto document that…

  • includes a dedicated title block with all your information and all information of your co-author(s), including ORCID ☝️ Hint: you might want to use a Quarto format
  • prints the code in the HTML output, while not printing the code in the PDF

🎁 Bonus 1

Add some content that should be excluded depending on the output format (HTML blog post vs. PDF manuscript).

🎁 Bonus 2

Add a filter that allows you to draft your abstract as part of the main text (rather than in the YAML meta data).

10:00